To the editor, SIL Notes on Computing,

A month or so ago I returned from Nigeria where I spent 11 months on the Wycliffe UK GRIP programme, working on a dictionary project in an existing Bible translation and literacy project. Alongside making our large dictionary I developed some software and a process to allow multiple people in multiple locations to modify the same Shoebox database at the same time.

I developed the software for our own needs but Doug Higby suggested I share the work with other SIL members. The working methodology is outlined below, but the software would need to be distributed somehow. I don't know how this would work best; whether I should put a version on the Web somewhere to link to from the article, or whether SIL could host the software. The core program is no more than 200K in size, but people may need a java installer, and the documentation and the (Java) source files, which bring it into the 20MB region (mostly due to the Java Runtime Environment installer). So, if you could advise me what to do before the article is submitted, I'd be grateful.

Also, I do welcome suggestions you may have about the content and presentation of the article below. I've written in Plain text so as to make transfer by email more straightforward, but would happily give you an HTML or MS Word version. The software has a manual written in Word and also available as a PDF.


------------------------------------------------------------------------
Plain-text Article:

[Note - no location for downloading/obtaining ShoeShop is currently given]

ShoeShop and Merging Concurrent/Collaborative Shoebox Linguistic Databases

David Rowbory, Wycliffe Scotland

To the members of SIL

This article is particularly for anyone working in a team situation with Shoebox databases, especially for dictionary projects. We describe our scenario of developing a large dictionary project using Shoebox as a team, updating the same shoebox database and later merging our changes intelligently and semi-automatically. Both the methodology and the software required for this are described. We are aware that some aspects of our project were unusual, but believe that this opens up exciting new possibilities for novel uses of Shoebox in the development of lexicons and dictionaries in many team situations.


1. Background: The Difficulties of Distributing Work on a Shoebox Database

From Jan-Dec 2001 I and another Wycliffe UK short-termer (Matthew Earwicker) worked on the C'Lela Dictionary Project in northern Nigeria. One local man had collected around 7000 words for a dictionary and our task was to get the dictionary on computer and formatted. The data were all ready for us on 7000 paper index cards, so we were able in the few months in Nigeria, to type, check and lay-out this fairly large dictionary. Shoebox was a very good tool for us. It was very efficient even with such a great pile of data. However we were dogged by the assumption and limitation that only one person could work on a database/lexicon at once.

1.1: A partial solution

This meant that either we would have to take it in turns to type index cards into the database, or some other solution would have to be found. We took the latter approach, and based our interim solution on creating a new database for each week of work. In a week we would type in around 500 entries each independently of each other, and at the end of the week, merge our databases using Shoebox's useful "merge" command. This "Merge" command simply joins two databases together. We would then save the merged database as a week's work, archive it and in turn merge that into our 'current full' lexicon file. Thus we always kept backups of each week's work in addition to having a full version ready.

1.2: Weaknesses in this solution

However, there were several drawbacks to this. Duplicate entries were only discovered much later when we painstakingly trawled through the entire database looking for them. Had we been typing the entries into the current full database then Shoebox would have alerted us that a similar record already existed. Likewise, working in fragments of a database at a time we were limiting our view of the database as a whole. We were usually unable to use Shoebox's link consistency checks on data we were entering; all links had to be checked later, when the lexicon was complete. Also, we could not tell if accidentally we had both typed in the same dictionary entry, or even a whole pack of 100 entries. Those would just show up later as duplicates. And finally, once we had completed the typing-in stage, again only one person could be allowed to make a change to the full lexicon database.

Shoebox experts might be able to suggest better ways of using the existing software to support multiple users (perhaps intricate systems of splitting the database into fragments and using jump paths to link them weakly, then some fragments would be for one user to edit and the rest for another user to edit). However we were spurred on to finding a more robust solution.

2: Inspiration from Mali and from Computer Science

We had a colleague - another GRIPper called Jonathan Burden - working with Doug Higby in Mali doing a dictionary project with potentially more contributors than we had. He wrote asking how Matt and I worked on our dictionary at the same time, given the 'just add together' way that Shoebox's merge command works.

He wrote:
> What I would like to do is send different people away to work on the
> same data in shoebox and then merge all of the notes and variants
> they make such that then a choice can be made by a committee as to
> which variants etc. to accept. As it is,  I have to key in all
> revisions, corrections etc. myself, as I am the only one with the up
> to date database.

2.1: Pessimistic Revision Control (optional, technical)

This reminded me of an age-old problem in software development where you have often a large body of computer code updated by many different people. Somehow it must be possible to let more than one person work on the same project at once without creating chaos. This is known as Revision Control, and generally helps the software developers keep track of who changed what where (and maybe why). One person at a time gets to change a part of a project, so that it is impossible for two people to change the same thing. That's Pessimistic Revision Control and works for projects consisting of many small files.

However, we had one big file we wanted to deal with. Also some Revision Control systems rely on all users being connected to a common network in order to manage who can 'check out' which file. In the remote locations SIL people work, it is unlikely two researchers will be connected by a network. They may even be a day's journey apart.


3: Optimistic Merge Control

A better solution is hinted at by computer science researchers in Sweden. [FOOTNOTE: I uncovered this while planning my final year computer science project just before graduating and going on GRIP. Department of Computer Science, Lund Institute of Technology, Lund University, P.O. Box 118, S-221 00 Lund, Sweden has published some useful articles on "Fine-Grained Revision Control for Collaborative Software Development".] If we expect people generally not to try to change the same thing, we can allow everyone write access to their own copies of an original document, and later merge the divergent copies in a controlled way.

In the Shoebox database scenario, we allow two (or more) people to update the same core database, resulting in both new copies diverging somewhat from the original. Then we could look at how the two updated databases compare and attempt an intelligent merge.

Rather than just sticking the two updated databases together, we examine what each user did to each record of the original. Then records which no-one changed stay the same and records which one person updated are updated. Only if two people changed the same record do we need to decide what to do. Such a decision (do these changes conflict with one another?) is easy for people to make, but non-trivial for computers. However, trawling manually through large databases is best done by computers, not people.

We will quite likely find that different people have updated different entries, so may be able accept everyone's changes. We just need to tell the computer how to identify conflicting changes (or potential conflicts) and then ask a human to arbitrate in those situations. 

From this we can see there are two parts to the solution: methodology and software.

3.1: The Methodology

The methodology can be summarised thus:
1. take one Shoebox database - the 'master', or 'base'
2. each editor takes a copy and updates it so that it diverges from the original
3. all divergent copies are brought together and merged
4. the merged database is now treated as authoritative and is the new master

3.2: The Merging Software (ShoeShop)

The software just speeds up the merging process (3) and makes the methodology worthwhile. If the methodology is not followed as above then the software described below will not be effective. ShoeShop performs three tasks:
1. Calculates the changes made by each user updating the 'master'
2. Finds out which of these conflict with each other and requests human arbitration.
3. Applies all acceptable changes from all users to the original 'master' resulting in the new, merged database.

What is a change? What is a conflict? This depends on the data being merged, but in our case we were dealing with lexical entries for a dictionary. We treated each lexical entry (record) separately. A change was the addition, deletion or alteration of a Shoebox record. Using MDF, we assumed the Shoebox field \lx heralded the start of a new record. We considered that two changes conflicted if it looked as if the same (or a very similar) lexical entry was being added, removed or changed.

ShoeShop began with a very simplistic conflict-detection algorithm which flagged more conflicts than necessary. The algorithm has evolved but is still fairly simple. Having detected a conflict it presents the user with one of 5 choices: to accept one person's change, or the other, to accept neither, to supply an authoritative version or (in only certain circumstances where it may make sense) to accept both changes. Care is taken over detecting whether two changes involve the same record, since multiple different records may share the same contents of the \lx field, in the case of homonyms.

3.2.1: A note on more manual merging techniques and debugging ShoeShop

We found that several software tools, notably Metrowerks CodeWarrior, include graphical file comparison tools which are useful for merging shoebox databases manually. These tools allow you to compare the contents, line by line, of 2 similar files to indentify what has changed.

3.2.2: More about the software

The ShoeShop program was only possible because Shoebox databases are in plain text format, which is easy for other programs to manipulate. The ShoeShop program is available from XXXXXXXXXXXXXXXXXXXX.

ShoeShop comes with documentation explaining the workings of the program in detail with screen shots. It was important to me that the program was cross-platform and open-source, so was written in Java, and the source should be distributed with the program. This should make it easier for others to see what I do and how, and to correct bugs and implement improvements themselves. No licensing has been arranged yet, so all software should be assumed to be Copyright (C) 2001 David Rowbory All Rights Reserved, although I give permission for any SIL member to use it as they wish, provided no modified versions are distributed without checking with me.

Why the name ShoeShop? Well, I needed to give the project some name, and it seemed that it would involve many Shoeboxes coming together, kind of like in a shoe shop. And there's a good tradition in naming things somethingShop (PhotoShop, VideoShop etc.)

3.2.3: Limitations of ShoeShop

ShoeShop requires large amounts of RAM to run. Our 1.5MB Shoebox databases usually required about 15MB RAM each to be processed, meaning we needed over 64MB RAM on the merge machine.

The requirement of a Java Runtime environment is troublesome to many. It makes it more awkward to use and slower than a normal C++ program, but it is a lot more portable and more stable. 

ShoeShop assumes you are using MDF and the lexeme field as record marker. This is a fairly arbitrary assumption and could be removed. But that was never going to be necessary for our needs.

The display is far from perfect. It would be ideal if ShoeShop picked up information on fonts etc from Shoebox but it doesn't. Fonts are problematic under Windows especially. They were working reasonably for us by the time we left, but I haven't documented exactly how to specify which field uses which font. The display of changes within ShoeShop needs a lot of work to be useful.

The design of the ShoeShop software assumes that all updated databases have derived from the same original. This is just so that we can calculate the changes made. It would be possible for each update to have an associated base file, but this does not seem particularly useful. An alternative to this time-consuming comparisons business, might be to record changes in some other way as the user makes the changes or when he saves the database.

ShoeShop does not allow records to be merged except by making an authoritative override change and doing the merge manually. Ideally ShoeShop would be able to look at the way 2 people have changed the same record and work out whether there's any need to ask the user about it, or whether we can just go ahead and merge both changes both users made to the two records.


4: Case Studies:

4.1: How we used ShoeShop in the C'Lela Dictionary Project

We used the ShoeShop software from its early days as we were conducting extensive checking and revision of the dictionary database. Only once very early on did it make a mess trying to merge the databases, and since it does not overwrite any data, that was no disaster. It seems to be reliable and stable.

Right until the day we printed master copies of the fully-formatted dictionary, we were using ShoeShop as an integral part of our work. My colleague Matthew and I would revise different aspects of the dictionary as we saw fit, and we merged databases once or twice a week so we kept fairly consistent with each other. This meant that we always had access to the whole database for manual and automatic consistency-checking and for checking words in the example sentences.

Usually when we merged there would be a handful of conflicts: perhaps 10 out of an average of 400 changes to records. Once or twice use of the Shoebox 'Replace All' feature meant that we unwittingly caused hundreds of conflicts. We learnt to save 'Replace All' changes to just after merging and before distribution of a new master database.

In summary, we can say that the ShoeShop software and its accompanying methodology saved us time and made it possible for us to work on the dictionary data in ways we had not anticipated.

4.2: How a distributed language team might use ShoeShop

Imagine two mother-tongue translators with an expatriate advisor working on each Bible translation project. Another expatriate is beginning a literacy programme together with a native-speaker interested in literacy development, and in particular, in building a dictionary for the language. There are also several other people who would be well qualified to help with the dictionary and translation projects, but they live far apart.

Right from their small beginnings, the translation team have developed a Shoebox lexicon containing some of the words in the translated scriptures. Both translators, as well as the advisor, are fairly computer-literate and able to work with Shoebox to amend the lexicon, especially as key biblical terms are discussed and decided upon for various parts of the translation work.

Sometimes the translators go to outlying areas to check comprehensibility of the translation with other dialects, and the lexicon needs to be updated with information on which words cause difficulties in different dialects, and which ones are common.

The people developing the dictionary also need to amend the lexicon. In their case, they are rapidly adding new entries for wildlife and cultural items which have not occurred in the translation work.

Until the began to use ShoeShop, someone (the expatriate involved with literacy) had to take complete ownership of the lexicon, and all changes had to be submitted to her, usually on paper. Amending the lexicon was thus rather time-consuming and frustrating for all concerned.

Now the whole team tries to meet once every 1 or 2 weeks to consolidate their work on the lexicon. Everyone brings their updated lexicon files together, and they are merged using ShoeShop. There are few conflicts - often none at all, especially since everyone is working on the lexicon in such different ways - and the merging process takes only about 5 minutes. The new lexicon file is given out to everyone again, replacing their old one, and all the old files are backed up onto CD, along with the new master, just in case of disaster.

All the workers are then at liberty to update their lexicon file as necessary, and a week or so later, their updates will be incorporated with everyone else's into a new master lexicon. Translators on checking sessions far away from the base are able to record their findings and update their lexicon files immediately.

Other people from far away who are interested in the dictionary, and who have some computer skills, are given a copy of the lexicon to review and edit certain aspects of it. They then send back their updated lexicon files. Before merging their changes, the person in charge of the dictionary project, can view each of the changes made by these external editors. [There is no way as yet to accept or reject individual changes which don't conflict with any other change, but that would be fairly easy to add to the program.]

Even when everyone is working in the same office, they can all have write access to their own copy of the lexicon file at the same time, so the old regulations about no-one changing the lexicon file except the literacy co-ordinator are long gone.


5: Acknowledgements

I am greatly endebted to Matthew Earwicker for his patient and helpful testing and suggestions which helped guide me as I built ShoeShop. I am also grateful to Jonathan Burden for the original inspiration and Doug Higby for his encouragement and suggestion of sharing this information abroad. We owe a lot to the authors of Shoebox (we used version 5) without which our dictionary would not have been and ShoeShop would have been impossible. The stability and efficiency of Shoebox is impressive. And of course I am most grateful indeed to God for watching over us, and giving me such a good year in northern Nigeria.


6: In Conclusion

ShoeShop is fairly simple cross-platform software designed to extend Shoebox's capabilities for a team environment. It is completely separate from the Shoebox program and will work data files generated by with any version of Shoebox, where \lx is the record marker.

I welcome feedback on the software and the methodology outlined above. I would like to make this more accessible and more useful to SIL members if possible. I'm happy to post a CD-ROM with the software (for Mac or Windows) to anyone who would like it. 